Ensemble Optical Character Recognition Systems via Machine Learning

نویسندگان

  • Zifei Shan
  • Haowen Cao
چکیده

Optical Character Recognition (OCR) Systems are widely used to process scanned text into text usable by computers. We observe that current OCR systems have bad performance on domain-specific papers, even generating lots of incorrect words; besides, different OCR systems make relatively independent mistakes. Based on these observations, we train an ensemble system from multiple open-source OCR systems, which chooses outputs from candidates generated by each OCR, and train the system with machine learning techniques. We implement Softmax Regression and multi-class SVM. Our system achieve over 80% accuracy selecting between different outputs on our training set of 1,011 words. We further explore ways to improve the performance by suggesting new options, and use domain knowledge to improve its performance. Our contribution lies in following aspects: (1) We show the great potential of treating OCR systems as black-boxes and correct their outputs from each other. (2) Our system build on best open-source OCRs and achieve significant improvement on their accuracy. (3) Moreover, our work explore the possibility to make use of rich semantic knowledge to craft a better OCR system, and cast insight to a general approach to ensemble systems as black-boxes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Character Recognition using Ensemble classifier

To improve the accuracy of data classification systems, several techniques using classifier fusion have been suggested. This paper proposed a model of classifier fusion for character recognition problem. The work presented here aims to tackle the disadvantages and benefit of different classifiers with varying feature sets. In particular, this approach proposes the use of statistical procedures ...

متن کامل

Soft Computing - Neural Networks Ensembles

Neural Network ensemble is a learning paradigm where a collection of finite number of neural networks is trained for the same task. It is understood that the generalization ability of neural networks, i.e., training many neural networks and then combining their predictions. ANN ensemble techniques have become very popular amongst neural network practitioners in a variety of ANN application doma...

متن کامل

Combining Classifiers and Learning Mixture-of-Experts

Expert combination is a classic strategy that has been widely used in various problem solving tasks. A team of individuals with diverse and complementary skills tackle a task jointly such that a performance better than any single individual can make is achieved via integrating the strengths of individuals. Started from the late 1980’ in the handwritten character recognition literature, studies ...

متن کامل

Fault Detection of Anti-friction Bearing using Ensemble Machine Learning Methods

Anti-Friction Bearing (AFB) is a very important machine component and its unscheduled failure leads to cause of malfunction in wide range of rotating machinery which results in unexpected downtime and economic loss. In this paper, ensemble machine learning techniques are demonstrated for the detection of different AFB faults. Initially, statistical features were extracted from temporal vibratio...

متن کامل

Recognition of Hand-Printed Characters via Induct-RDR

The goal of character recognition research is to simplify and automate the development of character recognition algorithms. We describe here an approach based on applying preprocessing to data sets of Latin characters and then applying a machine learning approach to the data sets to build a knowledge base able to classify unseen pre-processed characters. The machine learning method, Induct/RDR,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013